Noor A Tanjum Saba Amin (241000161)
A N M Zahid Hossain
(241002061)
Kazi Nabila Tasnim (241001961)
Upam Chowdhury
(241001161)
Mohammad Shafiur Rahman (241000661)
The healthcare insurance industry operates in a complex environment where financial stability hinges on accurately forecasting medical expenses while generating sufficient revenue from premiums. The inherent challenge lies in predicting these costs, which are often volatile due to the unpredictable nature of rare and high-cost medical conditions. This report presents a comprehensive analysis aimed at predicting individual insurance costs using a dataset that encompasses various personal health-related factors. By leveraging this predictive capability, we seek to identify the most significant variables impacting insurance costs, thereby enabling the development of more precise actuarial tables for premium adjustments.
Our analysis delves into multiple dimensions of the data, beginning with an exploration of descriptive statistics, which sets the stage for a deeper understanding of the dataset’s structure. We consider both numeric and categorical features, such as age, Body Mass Index (BMI), smoking habits, gender, region, and the number of children. Each of these variables potentially influences medical expenses in unique ways. For instance, smoking is widely recognized as a major health risk factor, potentially leading to higher medical costs.
A key aspect of our analysis is the visualization of the distribution of medical charges. We employ histogram plots created using the ggplot2 library and enhance them with the interactivity of the plotly library. This approach not only provides a clear view of the data distribution but also allows for a more engaging exploration of individual data points. The histogram reveals that the distribution of medical charges is right-skewed, indicating that while most policyholders incur relatively low charges, a small subset experiences significantly higher costs.
We further investigate the relationship between medical charges and
various categorical variables through a series of boxplots. These
visualizations include:
Boxplot of Medical Charges by Sex: This plot highlights any
gender-based disparities in medical costs, providing insights into
whether one gender consistently incurs higher charges.
Boxplot of Medical Charges by Region: By comparing different regions, we can identify geographical variations in healthcare expenses, which might reflect differences in regional healthcare practices or accessibility.
Boxplot of Medical Charges by Number of Children: This analysis explores how having children influences medical costs, considering that family healthcare needs can vary significantly.
Boxplot of Medical Charges by Smoking Status: Smoking is a critical variable, and this plot typically shows that smokers have higher medical charges compared to non-smokers.
To deepen our understanding, we examine the distribution of medical charges for smokers versus non-smokers. This comparison is visualized through overlaid histograms or density plots, each category distinguished by different colors. The interactive features of the plotly library are employed here as well, allowing users to hover over data points for precise values, thereby facilitating a more nuanced analysis.
Additionally, we analyze medical charges against continuous variables such as age, BMI, and the number of children, with separate scatter plots for smokers and non-smokers. These plots enable us to observe trends and patterns, such as how charges escalate with age or higher BMI, and how these trends differ between smokers and non-smokers. For instance, we often find that smokers exhibit a steeper increase in medical charges with age compared to non-smokers.
A correlation heatmap provides a comprehensive view of the relationships between all features. By examining the correlation coefficients, we can identify which variables are closely related and potentially redundant. This step is crucial for ensuring that our regression model is not adversely affected by multicollinearity, which can skew results and reduce model accuracy.
The core of our predictive analysis involves building a linear regression model to forecast medical charges. Using the Akaike Information Criterion (AIC) for model selection, we aim to identify a model that balances goodness-of-fit with model complexity. The AIC helps in choosing a model that is neither underfitted nor overfitted, thereby ensuring robust and reliable predictions.
Once the model is trained, we evaluate its performance using several metrics, including R-squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). These metrics provide a comprehensive assessment of the model’s predictive accuracy and its ability to generalize to new data. Further, we validate the model assumptions—linearity, independence, homoscedasticity, and normality of residuals—using diagnostic plots. Ensuring these assumptions hold true is critical for the validity of our regression model.
Finally, we apply the model to make predictions on new data and assess its performance. This step not only demonstrates the practical utility of our model but also provides insights into potential areas for refinement. By evaluating the predictions against actual outcomes, we can continuously improve our model, ensuring it remains accurate and relevant in changing healthcare landscapes.
Overall, this report offers a detailed examination of medical insurance cost prediction using a multifaceted approach. By integrating descriptive statistics, exploratory data analysis, advanced visualizations, and robust regression modeling, we provide a comprehensive framework for understanding and forecasting medical expenses. The insights gained from this analysis not only enhance our understanding of the factors driving medical costs but also equip the health insurance company with the tools needed to make informed decisions on premium adjustments, ultimately contributing to financial stability and improved healthcare management.
In this project, the dataset is already separated randomly into train
and test dataset.
The features in the dataset are:
Since we are predicting insurance costs, charges will be our target
feature.
Step 01: Check the current data types:
This will give an overview of the current data types for each column.
# Load the dataset
train <- read.csv("C:/Users/chowd/Desktop/train.csv")
# Check data types of each feature
str(train)
## 'data.frame': 1070 obs. of 7 variables:
## $ age : int 37 18 23 32 58 25 36 34 53 45 ...
## $ sex : chr "male" "male" "female" "male" ...
## $ bmi : num 34.1 34.4 36.7 35.2 32.4 ...
## $ children: int 4 0 2 2 1 2 1 1 0 5 ...
## $ smoker : chr "yes" "no" "yes" "no" ...
## $ region : chr "southwest" "southeast" "northeast" "southwest" ...
## $ charges : num 40182 1137 38512 4671 13019 ...
Step 02: Set the correct data types:
Based on the information, here are the appropriate data types:
age: Correctly identified as int.
sex: Convert from chr to
factor.
bmi: Correctly identified as num.
children: Correctly
identified as int.
smoker: Convert from chr to factor.
region:
Convert from chr to factor.
charges: Correctly identified as
num.
Convert sex, smoker, and region to factors:
# Convert sex, smoker, and region to factors
train$sex <- as.factor(train$sex)
train$children <- as.factor(train$children)
train$smoker <- as.factor(train$smoker)
train$region <- as.factor(train$region)
Step 03: Confirm the Changes:
# Confirm change
str(train)
## 'data.frame': 1070 obs. of 7 variables:
## $ age : int 37 18 23 32 58 25 36 34 53 45 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 1 1 2 1 2 2 ...
## $ bmi : num 34.1 34.4 36.7 35.2 32.4 ...
## $ children: Factor w/ 6 levels "0","1","2","3",..: 5 1 3 3 2 3 2 2 1 6 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 2 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 1 4 1 2 3 1 3 3 ...
## $ charges : num 40182 1137 38512 4671 13019 ...
We have used the duplicated() function to identify any duplicate row. To remove them !duplicated() function is used.
# Check for duplicated rows
duplicated_rows <- duplicated(train)
# Drop duplicated rows
train <- train[!duplicated_rows, ]
# Verify that duplicates have been removed
print(sum(duplicated(train)))
## [1] 0
# Check for missing values
missing_values <- sapply(train, function(x) sum(is.na(x)))
missing_values
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
The result indicates that there are no missing values in any of the columns in this data set.
To show descriptive statistics of the training dataset we have used the summary() function.
# Load necessary libraries for EDA
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Descriptive statistics
summary(train)
## age sex bmi children smoker
## Min. :18.00 female:543 Min. :15.96 0:466 no :850
## 1st Qu.:26.00 male :526 1st Qu.:26.32 1:258 yes:219
## Median :39.00 Median :30.40 2:191
## Mean :39.11 Mean :30.73 3:118
## 3rd Qu.:51.00 3rd Qu.:34.80 4: 19
## Max. :64.00 Max. :53.13 5: 17
## region charges
## northeast:249 Min. : 1122
## northwest:253 1st Qu.: 4747
## southeast:294 Median : 9447
## southwest:273 Mean :13212
## 3rd Qu.:16587
## Max. :63770
Explanation of Result
Numeric Features
Age: The age of individuals in the dataset ranges from a
minimum of 18 to a maximum of 64 years. The average age is approximately
39.11 years, with a median of 39 years. The distribution shows that 25%
of the individuals are 26 years old or younger, and 75% are 51 years old
or younger, indicating a fairly even age distribution across
adulthood.
BMI (Body Mass Index): The BMI values range from a minimum of
15.96 to a maximum of 53.13. The average BMI is approximately 30.73,
with a median of 30.40. The first quartile (25%) of individuals have a
BMI of 26.32 or lower, while the third quartile (75%) have a BMI of
34.80 or lower. This suggests that the dataset includes a wide range of
BMI values, from underweight to obese categories.
Children: The dataset is divided into six categories in terms
of the number of children of the insurance contributors: No Children
(466), one child (258), two children (191), three children (118), four
children (18), and five children (17). The majority of the dataset
consists of no children which makes the distribution right-skewed and
may influence the analysis of medical charges.
Charges: Insurance charges in the dataset vary widely, with a
minimum of $1122 and a maximum of $63770. The average charge is
approximately $13212, with a median of 9447. The first quartile (25%) of
charges are $4747 or lower, and the third quartile (75%) are $16587 or
lower, reflecting a significant variation in insurance costs among
individuals.
Categorical Features
Sex: The dataset includes 543 females and 526 males,
indicating a relatively balanced distribution between the two sexes.
This balance is important for ensuring that any analyses or models
developed from the dataset are not biased towards one sex.
Smoker: Among the individuals, 850 are non-smokers and 219 are
smokers. The majority of the dataset consists of non-smokers, which may
influence the analysis of health-related costs and outcomes, given the
known health risks associated with smoking.
Region: The dataset is divided into four regions: northeast
(249 individuals), northwest (253 individuals), southeast (294
individuals), and southwest (273 individuals). This fairly even
distribution across regions ensures that geographic factors can be
adequately analyzed and compared.
Summary
The dataset provides a comprehensive overview of both numeric and categorical features, revealing important insights into the characteristics of the individuals included. The age, BMI, children, and charges variables exhibit typical distribution patterns with no extreme outliers, while the sex, smoker status, and region variables show balanced distributions that are crucial for unbiased analysis. This well-rounded dataset is thus suitable for various analytical and modeling tasks aimed at understanding the relationships between these features.
To show the distribution of ‘Charges’ we have created a histogram
plot using the ggplot2 library. Additionally, we have the plotly library
to create an interactive version of the plot that displays the exact
data or output when hovering over the plot.
# Loading necessary library
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Create a histogram plot of 'Charges'
charge_plot <- ggplot(train, aes(x = charges)) +
geom_histogram(binwidth = 1000, fill = "skyblue", color = "black") +
labs(title = "Distribution of Charges",
x = "Charges",
y = "Frequency") +
theme_minimal()
# Convert ggplot to plotly for interactive features
charge_plot_interactive <- ggplotly(charge_plot, tooltip = c("x", "y"))
# Print the plot
charge_plot_interactive
Explanation of Result
The plot above displays the distribution of charges in the data set.
The x-axis represents the charges incurred, while the y-axis represents
the frequency of charges falling within each bin (1000).
X-axis (Charges): The x-axis shows the range of charges
incurred by individuals in the data set. Each bar represents a specific
range of charges, with the width of the bars determined by the
bandwidth.
Y-axis (Frequency): The y-axis represents the frequency of
charges falling within each bin. A higher bar indicates a higher
frequency of individuals with charges within that range.
From the Histogram we can get reconfirmation that although the data
is largely concentrated at the lower end significant variation is
present in medical charges.
Hovering over each bar in the plot will display additional
information, such as the exact frequency of charges or other relevant
data points associated with that particular bin, allowing for a more
detailed exploration of the distribution of charges in the data
set.
i. Boxplot of Medical Charges as per sex.
ii. Boxplot of
Medical Charges as per region.
iii. Boxplot of Medical Charges as
per children.
iv. Boxplot of Medical Charges as per smoker.
# Loading necessary library
library(ggplot2)
# Boxplots
features <- c('sex', 'region', 'children', 'smoker')
for (feature in features) {
p <- ggplot(train, aes_string(x = feature, y = 'charges')) +
geom_boxplot() +
theme_minimal() +
labs(title = paste('Boxplot of Charges by', feature), x = feature, y = 'Charges')
# Print the plot
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Explanation of Result
i. Boxplot of Medical Charges as per sex:
The boxplot shows the distribution of medical charges for males and
females. There seems to be some variation in charges between males and
females, with a slightly higher median charge for males compared to
females. However, further statistical analysis would be needed to
determine if this difference is significant.
ii. Boxplot of Medical Charges as per region:
The boxplot displays the distribution of medical charges across
different regions. There appears to be some variation in charges among
the four regions. For example, the median charge in the northeast region
seems to be higher compared to the other regions. However, outliers are
present in all regions, indicating that there may be other factors
influencing the charges.
iii. Boxplot of Medical Charges as per children:
The boxplot illustrates the distribution of medical charges based on
the number of children an individual has. There seems to be some
variation in charges across different numbers of children, with
individuals having more children generally having higher medical
charges. However, the difference in charges between individuals with
different numbers of children is not substantial.
iv. Boxplot of Medical Charges as per smoker:
The boxplot shows the distribution of medical charges for smokers and
non-smokers. There is a clear difference in charges between smokers and
non-smokers, with smokers generally having higher medical charges
compared to non-smokers. This suggests that smoking may be a significant
factor contributing to higher medical costs.
In summary, these boxplots provide insights into the relationship
between medical charges and various features such as sex, region, No. of
children, and smoking status. They reveal potential patterns and
differences in charges across different categories of these features,
which can be further explored and analyzed in detail.
To create a distribution of ‘charges’ categorizing it into smoker and non-smoker with two separate colors for each category, and to add interactive features showing the output when hovering over the plot, we used ggplot2 for creating the plot and then converted it to a plotly object for interactivity.
# Load required libraries
library(ggplot2)
library(plotly)
# Load the dataset
train <- read.csv("C:/Users/chowd/Desktop/train.csv")
# Create ggplot with separate colors for smoker and non-smoker
charge_distribution <- ggplot(train, aes(x = charges, fill = smoker)) +
geom_histogram(binwidth = 1000, color = "black") +
scale_fill_manual(values = c("skyblue", "salmon"), labels = c("Non-Smoker", "Smoker")) +
labs(title = "Distribution of Charges by Smoking Status",
x = "Charges",
y = "Frequency") +
theme_minimal()
# Convert ggplot to plotly for interactive features
charge_distribution_interactive <- ggplotly(charge_distribution, tooltip = "all")
# Add x and y axis titles
charge_distribution_interactive <- charge_distribution_interactive %>%
layout(xaxis = list(title = "Charges"),
yaxis = list(title = "Frequency"))
# Print the plot
charge_distribution_interactive
Explanation of Result
The histogram plot illustrates the distribution of medical charges
categorized by smoking status—specifically, into smokers and
non-smokers. Each bar in the plot represents a range of medical charges,
with distinct colors used to differentiate between smokers (depicted in
salmon) and non-smokers (shown in skyblue). The x-axis delineates the
charges incurred, while the y-axis quantifies the frequency of charges
within each category. Hovering over any bar in the plot reveals
additional details, such as the precise frequency of charges or other
relevant data points associated with that particular bin (1000).
The histogram clearly indicates medical charges for smokers are on the higher side.
To analyze medical charges by age, BMI, and children according to the
smoker factor, we used scatter plots. Each scatter plot has
medical charges on the y-axis and the respective feature (age, BMI, or
children) on the x-axis, with separate points representing smokers and
non-smokers.
# Load required libraries
library(ggplot2)
library(plotly)
# Load the dataset
train <- read.csv("C:/Users/chowd/Desktop/train.csv")
# Create scatter plots for medical charges by age, BMI, and children, categorized by smoker factor
charge_age_plot <- ggplot(train, aes(x = age, y = charges, color = smoker)) +
geom_point() +
labs(title = "Medical Charges by Age and Smoker Status",
x = "Age",
y = "Medical Charges") +
scale_color_manual(values = c("skyblue", "salmon"), labels = c("Non-Smoker", "Smoker")) +
theme_minimal()
charge_bmi_plot <- ggplot(train, aes(x = bmi, y = charges, color = smoker)) +
geom_point() +
labs(title = "Medical Charges by BMI and Smoker Status",
x = "BMI",
y = "Medical Charges") +
scale_color_manual(values = c("skyblue", "salmon"), labels = c("Non-Smoker", "Smoker")) +
theme_minimal()
charge_children_plot <- ggplot(train, aes(x = children, y = charges, color = smoker)) +
geom_point() +
labs(title = "Medical Charges by Children and Smoker Status",
x = "Children",
y = "Medical Charges") +
scale_color_manual(values = c("skyblue", "salmon"), labels = c("Non-Smoker", "Smoker")) +
theme_minimal()
# Convert ggplot to plotly for interactive features
charge_age_plot_interactive <- ggplotly(charge_age_plot, tooltip = "all") %>%
layout(xaxis = list(title = "Age"),
yaxis = list(title = "Medical Charges"))
charge_bmi_plot_interactive <- ggplotly(charge_bmi_plot, tooltip = "all") %>%
layout(xaxis = list(title = "BMI"),
yaxis = list(title = "Medical Charges"))
charge_children_plot_interactive <- ggplotly(charge_children_plot, tooltip = "all") %>%
layout(xaxis = list(title = "Children"),
yaxis = list(title = "Medical Charges"))
# Print the interactive scatter plots
charge_age_plot_interactive
charge_bmi_plot_interactive
charge_children_plot_interactive
Explanation of Result
Three scatter plots are generated to analyze the relationship between
medical charges and three key factors: age, BMI, and the number of
children, categorized by smoking status (smokers vs. non-smokers). Each
scatter plot depicts medical charges on the y-axis against the
respective feature (age, BMI, or children) on the x-axis. Individual
data points on the scatter plots represent specific observations from
the dataset, with distinct colors used to differentiate between smokers
(depicted in salmon) and non-smokers (shown in skyblue). Hovering over
any point in the plot reveals additional information, such as the exact
values of age, BMI, children, and medical charges for that particular
observation, providing users with detailed insights into each data
point.
From the scatter plots following facts could be observed:
1.
Irrespective of contributing factors smokers are likely to incur more
medical charges than non-smokers.
2. Medical charges increases with
age.
3. In cases of non-smokers BMI do not play a major role in
increasing of medical charges. However, for smokers charges are higher
and it significantly increases with age.
4. No. of Children does not
play any major contributing role for medical charges.
To create correlation heatmap according to the factors, we used ggplot & GGally
# Convert categorical variables to numerical for correlation calculation
train_numeric <- train
train_numeric$sex <- as.numeric(factor(train_numeric$sex))
train_numeric$smoker <- as.numeric(factor(train_numeric$smoker))
train_numeric$region <- as.numeric(factor(train_numeric$region))
# Compute the correlation matrix
correlation_matrix <- cor(train_numeric)
# Create the heatmap
library(ggplot2)
library(GGally)
ggcorr(correlation_matrix, label = TRUE)
The heat map that has been created shows positive correlation with
red and negative correlation in light blue.
From the scatter plots following facts could be observed:
1.
There is strong positive correlation (0.9) between Medical Charges and
Smoker Status
2. There is weak positive correlation (01) between Age
and Medical Charges
3. Interestingly, Age and BMI does not have any
correlation.
To implement linear regression with automatic feature selection using
backward elimination and the stepwise regression method in R, we used
the step() function. This function performs stepwise regression
by interactively adding or removing predictors from the model based on a
chosen criterion, such as the Akaike Information Criterion
(AIC).
Converting to relevant data type:
# Convert categorical variables to relevant data type for correlation calculation
train_return <- train
train$sex <- as.factor(train$sex)
train$smoker <- as.factor(train$smoker)
train$region <- as.factor(train$region)
# Load necessary library for linear regression
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
# Fit the full model
full_model <- lm(charges ~ ., data = train)
# Perform backward elimination
best_model <- step(full_model, direction = "backward")
## Start: AIC=18684.38
## charges ~ age + sex + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## - region 3 1.3613e+08 4.0477e+10 18682
## - sex 1 6.3567e+04 4.0341e+10 18682
## <none> 4.0341e+10 18684
## - children 1 2.9602e+08 4.0637e+10 18690
## - bmi 1 4.1317e+09 4.4472e+10 18787
## - age 1 1.3368e+10 5.3709e+10 18989
## - smoker 1 9.5658e+10 1.3600e+11 19983
##
## Step: AIC=18681.98
## charges ~ age + sex + bmi + children + smoker
##
## Df Sum of Sq RSS AIC
## - sex 1 1.4637e+05 4.0477e+10 18680
## <none> 4.0477e+10 18682
## - children 1 2.8926e+08 4.0766e+10 18688
## - bmi 1 4.1467e+09 4.4624e+10 18784
## - age 1 1.3536e+10 5.4013e+10 18989
## - smoker 1 9.6320e+10 1.3680e+11 19983
##
## Step: AIC=18679.98
## charges ~ age + bmi + children + smoker
##
## Df Sum of Sq RSS AIC
## <none> 4.0477e+10 18680
## - children 1 2.8912e+08 4.0766e+10 18686
## - bmi 1 4.1511e+09 4.4628e+10 18782
## - age 1 1.3545e+10 5.4022e+10 18987
## - smoker 1 9.6553e+10 1.3703e+11 19983
# Summary of the best model
summary(best_model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11733 -2982 -1005 1354 29708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11910.54 1059.71 -11.239 < 2e-16 ***
## age 254.97 13.51 18.878 < 2e-16 ***
## bmi 320.62 30.68 10.451 < 2e-16 ***
## children 430.55 156.10 2.758 0.00591 **
## smokeryes 23587.56 467.98 50.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6165 on 1065 degrees of freedom
## Multiple R-squared: 0.7361, Adjusted R-squared: 0.7351
## F-statistic: 742.8 on 4 and 1065 DF, p-value: < 2.2e-16
# AIC of the final model
aic_value <- AIC(best_model)
cat("AIC of the final model:", aic_value, "\n")
## AIC of the final model: 21718.51
# Save the best model
saveRDS(best_model, file = "best_model.rds")
Explanation of Result
The output of the linear regression analysis provides valuable insights into the relationship between the dependent variable, ‘charges’, and the independent variables: ‘age’, ‘bmi’, ‘children’, and ‘smoker’. The call to the lm() function specifies the formula for the linear regression model and the dataset used, which is ‘train’. The residuals section offers summary statistics, including the minimum, 1st quartile, median, 3rd quartile, and maximum values of the residuals.
The coefficients table displays the estimated effect size of each predictor on the dependent variable. For instance, for every one-unit increase in ‘age’, the ‘charges’ are estimated to increase by $254.97, holding other variables constant. The significance of each predictor is indicated by the p-values, with values less than 0.05 suggesting statistical significance. In this case, all predictors are statistically significant.
The R-squared value of 0.7361 indicates that approximately 73.61% of the variance in ‘charges’ is explained by the predictors included in the model. Additionally, the adjusted R-squared accounts for the number of predictors and sample size in the model. The F-statistic assesses the overall significance of the model, with a significant p-value (< 0.05) indicating that the model as a whole is statistically significant in predicting ‘charges’.
Finally, the Akaike Information Criterion (AIC) of the final model is 21718.51. A lower AIC value suggests a better model fit relative to other models. Overall, this analysis provides insights into the predictive power of the selected predictors and the overall model fit in explaining the variation in ‘charges’.
To make predictions using the model selected in Step 01 and the training dataset, we used the predict() function in R. This function takes the trained model and the dataset as inputs and returns predicted values based on the model.
# Predict using the training dataset
predictions <- predict(best_model, newdata = train)
# Load necessary library for linear regression
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.3
# Load the dataset
test <- read.csv("C:/Users/chowd/Desktop/test.csv")
# Predict using the testing dataset
test_predictions <- predict.lm(best_model, newdata = test)
# Calculate MAE and RMSE
mae_value <- mae(test$charges, test_predictions)
rmse_value <- rmse(test$charges, test_predictions)
cat("Mean Absolute Error (MAE):", mae_value, "\n")
## Mean Absolute Error (MAE): 3941.069
cat("Root Mean Squared Error (RMSE):", rmse_value, "\n")
## Root Mean Squared Error (RMSE): 5672.011
Explanation of Result
The Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values provide insights into the performance of the predictive model when comparing the charges predicted by the model with the actual charges in the testing dataset.
Mean Absolute Error (MAE): The MAE value of 3941.069 indicates that, on average, the predictions made by the model differ from the actual charges by approximately $3941.07. In other words, the average absolute deviation between the predicted charges and the actual charges is $3941.07. Lower MAE values signify that the model’s predictions are closer to the actual charges, reflecting better accuracy.
Root Mean Squared Error (RMSE): The RMSE value of 5672.011 represents the square root of the average of the squared errors between the predicted charges and the actual charges. It indicates that, on average, the predictions deviate from the actual charges by approximately $5672.01. RMSE penalizes larger errors more heavily compared to MAE. Similar to MAE, lower RMSE values signify better accuracy of the model’s predictions.
Interpreting these results, while the MAE and RMSE values provide insights into the overall performance of the model, it’s essential to consider them in the context of the problem domain and the range of charges in the dataset. In this case, with MAE and RMSE values of 3941.069 and 5672.011 respectively, we observe that the model’s predictions exhibit some level of deviation from the actual charges. Further refinement of the model or feature engineering may be necessary to improve prediction accuracy. Additionally, comparing these error metrics with previous models or benchmark values can provide additional context for assessing the model’s performance.
The output represents the summary of a linear regression model fitted to the training dataset, with the response variable being ‘charges’ and the predictor variables including ‘age’, ‘bmi’, ‘children’, and ‘smoker’.
# Load the dataset
test <- read.csv("C:/Users/chowd/Desktop/test.csv")
# Model summary
summary(best_model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11733 -2982 -1005 1354 29708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11910.54 1059.71 -11.239 < 2e-16 ***
## age 254.97 13.51 18.878 < 2e-16 ***
## bmi 320.62 30.68 10.451 < 2e-16 ***
## children 430.55 156.10 2.758 0.00591 **
## smokeryes 23587.56 467.98 50.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6165 on 1065 degrees of freedom
## Multiple R-squared: 0.7361, Adjusted R-squared: 0.7351
## F-statistic: 742.8 on 4 and 1065 DF, p-value: < 2.2e-16
Explanation of Result
The output of the linear regression analysis provides valuable insights into the relationship between the dependent variable, ‘charges’, and the independent variables: ‘age’, ‘bmi’, ‘children’, and ‘smoker’. The call to the lm() function specifies the formula for the linear regression model and the dataset used, which is ‘train’. The residuals section offers summary statistics, including the minimum, 1st quartile, median, 3rd quartile, and maximum values of the residuals.
The coefficients table displays the estimated effect size of each predictor on the dependent variable. For instance, for every one-unit increase in ‘age’, the ‘charges’ are estimated to increase by $254.97, holding other variables constant. The significance of each predictor is indicated by the p-values, with values less than 0.05 suggesting statistical significance. In this case, all predictors are statistically significant.
The R-squared value of 0.7361 indicates that approximately 73.61% of the variance in ‘charges’ is explained by the predictors included in the model. Additionally, the adjusted R-squared accounts for the number of predictors and sample size in the model. The F-statistic assesses the overall significance of the model, with a significant p-value (< 0.05) indicating that the model as a whole is statistically significant in predicting ‘charges’.
Finally, the Akaike Information Criterion (AIC) of the final model is 21718.51. A lower AIC value suggests a better model fit relative to other models. Overall, this analysis provides insights into the predictive power of the selected predictors and the overall model fit in explaining the variation in ‘charges’.
We confirmed the validity of the linear model assumptions by cheeking linearity, residual normality, homoscedasticity and multicollinearity.
library(ggplot2)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
library(plotly)
# Testing Linearity
cor_test <- cor.test(train$charges, predict(best_model))
cor_test
##
## Pearson's product-moment correlation
##
## data: train$charges and predict(best_model)
## t = 54.585, df = 1068, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8413143 0.8730247
## sample estimates:
## cor
## 0.8579848
# Residual Normality
residuals <- residuals(best_model)
residual_hist <- ggplot(data.frame(residuals), aes(x = residuals)) +
geom_histogram(binwidth = 500, fill = 'skyblue', color = 'black', alpha = 0.7) +
theme_minimal() +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Density')
# Convert histogram to plotly object
residual_hist <- ggplotly(residual_hist)
# Checking Homoscedasticity
residuals_vs_fitted <- ggplot(data.frame(fitted.values(best_model), residuals), aes(x = fitted.values(best_model), y = residuals)) +
geom_point(alpha = 0.7) + # Add transparency for better visualization
geom_smooth(method = 'loess', se = FALSE, color = 'skyblue') + # Add LOESS smoother for trend line
theme_minimal() +
labs(title = 'Residuals vs Fitted Values', x = 'Fitted Values', y = 'Residuals')
# Convert residuals vs fitted plot to plotly object
residuals_vs_fitted <- ggplotly(residuals_vs_fitted)
## `geom_smooth()` using formula = 'y ~ x'
# Assessing Multicollinearity
library(car)
vif_values <- vif(best_model)
vif_values
## age bmi children smoker
## 1.018337 1.013160 1.004207 1.003663
residual_hist
residuals_vs_fitted
The output provided is the result of Pearson’s Product-moment correlation test. Here’s a detailed interpretation of each part of the output:
Pearson’s product-moment correlation
1.Test Description
data: This specifies the two set of variables that were analyzed:
- train$charges: This represents the actual charges (or
true values) in the training dataset.
-
predict(best_model): This represents the predicted charges
generated by the best model.
2.Test Statistics
3.Alternative Hypothesis
train$charges and
predict(best_model) is not zero.4.Confidence Interval
5.Sample Estimates
train$charges and predict(best_model). A
correlation coefficient of 0.858 (rounded) suggests a very strong
positive linear relationship.Interpretation
Strong Positive Correlation: The correlation coefficient (0.858)
indicates a very strong positive linear relationship between the actual
charges and the predicted charges from the best model. This means that
as the actual charges increase, the predicted charges also tend to
increase proportionally.
Statistical Significance: The p-value (< 2.2e-16) is much
smaller than any conventional significance level (e.g., 0.05),
indicating that the correlation is statistically significant. There is
very strong evidence against the null hypothesis of no
correlation.
Confidence Interval: The 95% confidence interval (0.841 to 0.873)
suggests that we can be 95% confident that the true correlation
coefficient lies within this range. This reinforces the finding of a
strong positive correlation.
In summary, the result shows a highly significant and strong positive
correlation between the actual and predicted charges, indicating that
the best model is very effective in predicting the charges.
Residual Normality
By observing the residual histogram we
may conclude that the distribution is slightly right-skewed. This
implies that the model tends to overestimate the dependent variable
“Charges” slightly. (add a trend line)
Heteroschendasticity in data refers to a situation where the
variability of the residuals (the differences between observed and
predicted values) is not constant across all levels of the independent
variable(s). Instead, the spread of the residuals changes as the values
of the independent variable(s) change.
Checking the best model for
heteroschendasticity by plotting residuals we have come to decision that
the model heteroschendastic.
Multicollinearity
1. Criteria:
Threshold for identifying multicollinearity
is a VIF value of 10.
2. Interpretation of Specific VIFs:
age: The VIF of 1.018 indicates that the variance of the
coefficient estimates for the predictor variable “age” is inflated by a
factor of approximately 1.018 due to multicollinearity.
bmi: The VIF of 1.013 suggests a similar interpretation for
the predictor variable “bmi.”
children: The VIF of 1.004 suggests minimal multicollinearity
for the predictor variable “children.”
smoker: The VIF of 1.003 indicates minimal multicollinearity
for the predictor variable “smoker.”
Decision:
Since all VIF values are close to 1, it suggests
that multicollinearity is not a significant issue among the predictor
variables in the model.
This report developed a predictive model for medical insurance costs using demographic and health-related data. Through comprehensive data analysis, we identified key factors affecting costs, such as age, BMI, number of children, and smoking status. Visualizations revealed significant patterns, particularly higher charges for smokers. A linear regression model, validated rigorously, demonstrated strong predictive accuracy. The model’s application to new data showed its practical utility, aiding precise premium setting and financial risk management for health insurance companies. Overall, this project provides valuable insights for enhancing financial planning and operational efficiency in the healthcare insurance industry.